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PREFACE 


This  report  describes  activities  performed  in  support  of  USAF  personnel  selection  and 
classification  (AF/AIPF),  Work  Unit  2313HC58.  The  authors  thank  Mr  Ken  Schwartz 
(AETC/DPSF)  and  the  AFPC  Human  Resources  Data  Bank  (HRRD)  for  support  in  the 
development  of  the  database  used  in  this  study. 


INTRODUCTION 


The  Air  Force  Officer  Qualifying  Test  (AFOQT)  is  the  latest  in  a  line  of  aptitude  and 
aehievement  tests  that  traees  its  beginnings  to  the  World  War  II  Army  Aviation  Psyehology 
Program  (AAPP;  Davis,  1947;  Flanagan,  1948).  Beginning  in  the  early  1940s,  many  graduate 
students  were  offered  eommissions'  to  serve  in  the  Army  and  eonduet  researeh  on  military  and 
aviation  related  topies.  The  topies  eovered  many  aspeets  of  assessment  and  elassifieation  into 
military  speeialties.  The  history  of  the  program  and  its  researeh  is  doeumented  in  a  series  of 
books  published  by  the  Government  Printing  Offiee  in  the  late  1940s.  The  list  of  eontributors 
reads  like  a  who’s  who  of  psyehometries  for  the  next  four  deeades.  Ineluded  are  John  Flanagan, 
Robert  Thorndike,  Lloyd  Humphreys,  Arthur  Melton,  Frederiek  B.  Davis,  and  Philip  DuBois 
(Flanagan,  1948). 

Based  on  the  AAPP  results,  the  Army  Air  Corps  and  later  the  Air  Foree  instituted 
apparatus-based  tests  sueh  as  the  multidimensional  pursuit  test,  drift  eorreetion  test,  stiek  and 
rudder  manipulation,  and  eomplex  eoordination  (Melton,  1947).  There  was  a  single  eentralized 
testing  site  where  all  apparatuses  were  maintained  in  eareful  ealibration.  When  apparatus  testing 
was  deeentralized  to  many  sites  it  beeame  nearly  impossible  to  maintain  administration 
eonsisteney  and  the  standards  of  ealibration.  The  system  beeame  unworkable  and  was 
diseontinued.  It  was  proposed  that  paper-and-peneil  tests  replaee  the  apparatus  tests. 

The  immediate  operational  preeursor  of  the  AFOQT  was  the,  Aviation  Cadet  Qualifying 
Test  (ACQT)  whieh  eonsisted  of  13  subtests  with  a  total  of  300  items.  Some  of  the  eontent  was 
expeetable  sueh  as  “General  Mathematies”  and  “Meehanieal  Prineiples.”  Some  was  unexpeeted 
sueh  as  “Current  Affairs”  and  “Biographieal  Data”  heavily  weighted  to  hunting  and  outdoor 
aetivities.  The  first  AFOQT,  Form  A,  was  implemented  in  1953  (Rogers,  Roaeh,  &  Short,  1986; 
Valentine  &  Creager,  1961).  Over  the  years,  the  test  has  gone  through  many  versions  and 
numerous  modifieations  to  its  eontent.  AFOQT  Form  A  was  issued  with  665  items  in  15 
sub  tests.  AFOQT  Form  B  is  notable  as  it  was  used  for  seleetion  of  the  first  elass  of  the  newly 
eompleted  US  Air  Foree  Aeademy  in  Colorado  Springs,  Colorado.  This  form  had  835  items  in 


*  Frederick  B.  Davis  told  the  story  to  my  (Ree)  class  of  graduate  students  in  1972  of  how  he  was  offered  a 
commission  or  if  he  did  not  accept  he  could  take  his  chances  with  the  draft.  Fie  was  commissioned  in  the  US  Army 
shortly  thereafter  and  participated  in  the  AAPP  first  in  Santa  Anna,  California  then  in  San  Antonio,  Texas. 
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17  subtests.  Unique  subtests  ineluded  “Interests”  and  “Aerial  Landmarks.”  Form  C  was  redueed 
to  645  items  while  Form  D  marked  the  apogee  with  855  items  with  separate  pilot  and  offieer 
biographieal  inventories  in  1957.  Through  Form  G  the  AFOQT  had  elose  to  800  items.  During 
the  1960s  and  1970s  there  were  small  ehanges  in  the  battery  eontent  and  nomenelature  and  forms 
were  ealled  AFOQT-64,  AFOQT-66,  and  AFOQT-68.  Then  there  was  a  marked  deeline  in  the 
number  of  items  on  forms  H  through  N.  In  1978,  Form  N  had  606  items  in  18  subtests.  With  the 
implementation  of  Form  O,  four  subtests  (Baekground  for  Current  Events,  Tools,  Aerial 
Landmarks,  and  Pilot  Biographie  and  Attitude  Seale)  were  removed  from  its  immediate 
predeeessor.  Form  N,  and  2  new  sub  tests  were  added  (Aviation  Information  and  Hidden 
Figures).  AFOQT  Forms  O,  P,  Q,  and  R  had  16  subtests  with  380  items.  Form  R  was  never 
implemented  but  was  eventually  revised  and  published  in  2005  as  Form  S  with  250  items  in  1 1 
sub  tests. 

The  Air  Force  Officer  Qualifying  Test  is  used  to  award  US  Air  Foree  (USAF)  Reserve 
Offieer  Training  Corps  (ROTC)  seholarships  and  to  qualify  applieants  for  offieer  eommissioning 
through  the  ROTC  and  Offieer  Training  Sehool  (OTS)  programs.  The  AFOQT  also  is  used  to 
qualify  applieants  for  airerew  training  as  pilots,  eombat  system  operators  (formerly  navigators), 
and  air  battle  managers  if  they  pass  other  edueational,  fitness,  medieal,  moral,  and  physical 
requirements.  For  operational  use,  the  subtests  are  combined  into  five  overlapping  composites 
(see  Table  1).  The  Verbal,  Quantitative,  and  Academic  Aptitude  composites  are  used  to  qualify 
applicants  for  ROTC  and  OTS  officer  commissioning  programs.  The  Pilot  and 
Navigator/Technical  composites  are  used  to  qualify  applicants  for  aircrew  training.  The  AFOQT 
has  been  validated  against  officer  training  performance  (Roberts  &  Skinner,  1996),  several 
aircrew  training  performance  criteria  including  passing/failing  training,  training  grades,  and  class 
rank  (Carretta,  in  press;  Carretta  &  Ree,  2003;  Olea  &  Ree,  1994),  and  several  non-aviation 
officer  jobs  (Arth,  1986;  Arth  &  Skinner,  1986;  Finegold  &  Rogers,  1985;  Hartke  &  Short, 

1988). 
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Table  1.  Composition  of  AFOQT  Form  S  Aptitude  Composites 


Composite 


Subtest 


Aeademic  Navigator/ 

Verbal  Quantitative  Aptitude  Pilot  Technical 


Verbal  Analogies  (VA)  X 

Arithmetic  Reasoning  (AR) 

Word  Knowledge  (WK)  X 

Math  Knowledge  (MK) 

Instrument  Comprehension  (IC) 

Block  Counting  (BC) 

Table  Reading  (TR) 

Aviation  Information  (AI) 

Rotated  Blocks  (RB) 

General  Science  (GS) 

Hidden  Figures  (HF) 


X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 


Note.  Although  RB  and  HF  were  retained  in  AFOQT  Form  S,  they  do  not  contribute  to  any  of 
the  operational  composite  scores. 


From  1980  through  2005,  the  AFOQT  (Forms  O,  P,  and  Q)  consisted  of  the  same  16 
sub  tests  and  each  form  was  equated  to  the  same  normative  score  metric.  Planned  implementation 
of  Form  R  was  suspended  as  AFOQT  program  managers  initiated  a  study  to  evaluate  methods  to 
reduce  test  administration  time  without  adversely  affecting  its  effectiveness.  The  goal  was  to 
determine  the  minimum  test  length  or  composition  that  maintained  the  current  AFOQT 
psychometric  characteristics.  Successful  achievement  of  these  psychometric  objectives  would 
reduce  the  administration  burden  and  examinee  fatigue,  and  possibly  make  time  available  for 
new  subtests  with  new  content.  Analyses  indicated  that  five  subtests  could  be  removed  while 
maintaining  cognitive/knowledge  content  areas,  reliability,  and  predictive  validity,  and  avoiding 
an  increase  in  adverse  impact.  Form  R  was  revised  and  implemented  as  Form  S  in  2005.  The 
administration  time  had  been  reduced  from  4.5  to  3  hours  with  the  removal  of  five  sub  tests: 
Reading  Comprehension,  Data  Interpretation,  Mechanical  Comprehension,  Electrical  Maze,  and 
Scale  Reading. 
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Form  S  was  implemented  without  an  emp ideal  evaluation  of  its  faetor  strueture  in  a 
sample  of  applieants.  Due  to  the  substantial  ehanges  from  earlier  16  subtest  forms,  the  purpose  of 
this  study  was  to  examine  the  latent  faetor  structure  of  the  1 1  sub  test  AFOQT  Form  S  and 
compare  it  to  that  of  previous  forms.  In  addition,  the  measurement  equivalence  of  Form  S  was 
compared  across  gender  and  racial/ethnic  groups.  Measurement  equivalence  in  increasingly 
important  as  the  Air  Force  becomes  more  diverse. 


Latent  Structure  of  Earlier  AFOQT  Forms  and  Comparison  to  Form  S 

Carretta  and  Ree  (1996)  analyzed  data  from  a  sample  of  3,000  applicants  for  Air  Force 

commissions  who  had  taken  the  16  subtest  AFOQT.  Model  1  in  their  study  corresponded  to  the 
operational  composites  in  use  at  that  time  (but  excluded  Academic  Aptitude  because  it  was 
linearly  dependent  on  the  Verbal  and  Quantitative  composites).  This  model  was  found  to  have  a 
relatively  poor  fit  to  the  data.  Carretta  and  Ree’s  Model  2  was  based  on  Skinner  and  Ree’s 
(1987)  exploratory  factor  analysis,  which  found  verbal,  math,  spatial,  aircrew,  and  perceptual 
speed  factors.  This  model,  with  factors  constrained  to  be  orthogonal,  also  had  a  poor  fit. 

Carretta  and  Ree’s  (1996)  Model  5  consisted  of  Model  2  augmented  by  a  general  factor, 
psychometric  g,  which  enabled  the  model  to  account  for  correlations  of  subtests  loading  on 
different  first  order  factors.  This  model  provided  an  excellent  fit  to  the  data,  with  a  root  mean 
square  error  of  approximation  (RMSEA)  of  .071,  a  comparative  fit  index  (CFI)  of  .957,  and  an 
average  absolute  standardized  residual  of  .027.  Because  a  model  with  several  orthogonal  content 
factors  (i.e.,  verbal,  math,  etc.)  and  a  single  general  factor  is  a  nested  submodel  of  a  first  order 
factor  model  with  correlated  factors  (Yung,  Thissen,  &  McLeod,  1999),  Skinner  and  Ree’s 
(1987)  five  factors  (Model  2)  would  be  expected  to  provide  a  good  description  of  the  AFOQT 
data  if  they  were  allowed  to  be  oblique. 

To  compare  the  latent  structure  of  the  new  Form  S  to  earlier  forms,  two  types  of 
correlated  factor  models  were  fitted  to  the  data.  First,  we  fit  models  corresponding  to  the  current 
AFOQT  composites.  This  provides  information  about  the  degree  to  which  the  composites  are 
aligned  with  the  latent  factors  underlying  the  test  battery.  Second,  we  fit  models  based  on 
Skinner  and  Ree’s  (1987)  substantive  factors.  Because  Carretta  and  Ree  (1996)  found  these 
factors  (with  a  general  factor)  to  provide  the  best  description  of  the  16  subtest  AFOQT,  we 
expected  models  based  on  this  framework  to  fit  well. 
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We  also  fit  bifactor  models  to  the  AFOQT  subtests.  Although  Carretta  and  Ree  (1996) 
cited  Schmid  and  Leiman  (1957),  their  Model  5  is  more  closely  related  to  a  bifactor  model  than  a 
Schmid-Leiman  higher-order  factor  model.  A  bifactor  model  allows  observed  variables  to  load 
directly  on  a  single  general  factor  and  a  specific  factor.  For  example,  the  Arithmetic  Reasoning 
subtest  would  be  expected  to  load  on  the  general  factor,  a  mathematical  reasoning  specific  factor, 
and  an  error  factor.  Carretta  and  Ree’s  Model  5  has  a  bifactor  structure  with  a  few  additional 
cross-loadings  (e.g.,  the  Block  Counting  subtest  loaded  on  spatial  and  perceptual  speed  factors  in 
addition  to  the  general  factor).  A  Schmid-Leiman  higher-order  model,  in  contrast  is  more 
restrictive  than  the  bifactor  model  because  it  has  additional  proportionality  constraints  on  factor 
loadings  (see  Yung  et  ah,  1999,  p.  115). 

Form  S  of  the  AFOQT  presents  a  challenge  to  confirmatory  factor  analysis  because  two 
of  its  composites  and  two  of  its  factors  are  expected  to  have  nonzero  loadings  for  only  two 
subtests.  It  is  well  known  that  statistical  estimation  of  factor  loadings  requires  at  least  three 
nonzero  loadings.  In  this  situation,  a  trick  is  sometimes  used:  one  factor  loading  is  fixed  at  a 
nonzero  constant  (e.g.,  one),  one  factor  loading  is  estimated,  and  then  the  variance  of  that  factor 
is  treated  as  a  free  parameter  to  be  estimated.  It  turns  out  that  this  approach  does  not  actually 
estimate  the  factor  loadings;  it  only  estimates  the  ratio  of  the  second  loading  to  the  first. 

To  better  understand  the  latent  structure  of  Form  S,  we  also  analyzed  multi-item 
composites.  We  took  this  approach  because  the  large  number  of  items  precluded  an  item-level 
factor  analysis.  Thus,  for  each  subtest,  mutually  exclusive  and  exhaustive  sets  of  items  were  used 
to  form  multi-item  composites  (called  “item  parcels”  by  Dorans  &  Lawrence,  1987)  and  then  the 
composites  were  factor  analyzed.  For  example,  five  composites  were  formed  for  the  Verbal 
Analogies  subtest  and  four  composites  were  used  for  the  Instrument  Comprehension  subtest. 
Because  there  were  five  parcels  for  the  Arithmetic  Reasoning  and  Math  Knowledge  subtests, 
factor  loadings  for  10  observed  variables  could  be  estimated  for  the  mathematical  reasoning 
factor  and  hence,  factor  loadings  were  statistically  identified.  Moreover,  the  sampling 
distribution  of  parcels  more  closely  approximates  the  distribution  assumed  by  linear  factor 
analysis  models  (i.e.,  multivariate  normality)  than  the  sampling  distribution  of  individual  items. 
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Measurement  Equivalence 

The  question  addressed  by  measurement  equivalence  analyses  is  whether  individuals  with 
equal  standings  on  the  underlying  trait  assessed  by  a  test,  but  sampled  from  different  groups, 
have  equal  expected  observed  test  scores  (Drasgow,  1984).  For  example,  do  individuals  with 
equal  quantitative  ability,  but  sampled  from  different  groups,  have  equal  expected  scores  on  the 
Arithmetic  Reasoning  sub  test?  To  examine  measurement  equivalence,  mean  and  covariance 
structure  (MACS;  Sorbom,  1974)  analysis  was  used.  Here,  the  traditional  factor  analysis  model, 

X  =  A^  +  d, 
is  augmented  to 
x  =  t  +  K^  +  5, 

where  x  is  the  vector  of  observed  variables,  A  is  the  matrix  of  factor  loadings,  f  is  the  vector  of 
factor  scores,  and  <5  is  a  vector  of  errors.  The  difference  between  these  two  equations  is  the 
vector  X.  Ordinarily,  a  correlation  or  covariance  matrix  is  input  to  factor  analysis;  to  this,  MACS 
adds  a  vector  of  means  of  the  observed  variables  and  the  vector  x  contains  the  intercepts  of  the 
linear  regressions  of  the  observed  variables  x  on  the  latent  variables  f. 

To  study  measurement  invariance  across  groups  g,  the  MACS  model  is  written  as 

x(g) 

where  the  superscript  g  is  used  to  indicate  the  group,  g  =  1,  G.  Discussions  of  measurement 
invariance  in  the  context  of  factor  analysis  (e.g.,  Ployhart  &  Oswald,  2004;  Vandenberg  & 

Lance,  2000)  frequently  mention  configural,  metric,  and  scalar  invariance.  Configural  invariance 
is  obtained  when  the  pattern  of  fixed  (at  zero)  and  estimated  factor  loadings  is  identical  across 
groups.  Here  the  factor  loadings  may  vary  across  groups,  but  each  observed  variable  loads  on  the 
same  factor(s)  for  all  groups. 

Metric  invariance  is  a  more  restrictive  condition  than  configural  invariance.  It  imposes 
the  constraint  that  the  factor  loading  matrix,  A^^^  is  identical  across  all  the  groups.  Provided  that 
configural  invariance  holds,  metric  invariance  can  be  examined  by  the  change  in  chi-square 
across  configural  and  metric  models:  the  metric  invariance  model  is  a  nested  submodel  of  the 
configural  invariance  model.  Rather  than  a  statistical  test  of  the  change  in  chi-square,  ordinarily 
researchers  examine  the  change  in  goodness  of  fit  measures  such  as  the  RMSEA  and  the  CFI. 
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Finally,  scalar  invariance  is  even  more  restrietive  than  metrie  invarianee.  Sealar 
invarianee  ean  be  examined  provided  that  metrie  invarianee  holds.  Here,  the  eonstraint  of 
invariant  thresholds  is  added  to  invariant  faetor  loadings  .  Again,  goodness  of  fit 
statisties  should  be  examined  to  assess  the  extent  to  whieh  adding  this  eonstraint  degrades  fit. 

When  sealar  invarianee  is  obtained,  individuals  with  the  same  standings  on  the  latent 

traits  but  sampled  from  different  groups  g,  have  the  same  expeeted  observed  seore.  In  the 
language  of  item  response  theory  (IRT),  there  is  no  differential  item  or  test  funetioning.  This  is  a 
very  important  property  beeause  it  means  that  no  group  is  disadvantaged  by  the  test:  one’s 
underlying  abilities  ^  are  transformed  to  observed  seores  in  the  same  way  for  all  groups. 

It  is  important  to  note  the  distinetion  between  measurement  invarianee  and  impaet.  In  a 
MACS  model,  measurement  invarianee  holds  when  and  A^^^  are  invariant  aeross  groups. 
However,  it  is  possible  for  groups  to  differ  in  their  mean  level  of  ability.  For  example,  one  group 
may  have  higher  or  lower  means  on  the  latent  traits.  We  use  the  veetor  to  denote  the  faetor 
means,  for  group  g.  Thus,  ean  vary  aeross  groups  beeause,  without  random 

assignment  of  people  to  groups  there  is  no  reason  to  expeet  groups  to  be  equally  skilled  in  the 
eharaeteristies  assessed  by  the  subtests.  Then  invariant  and  A*^^  mean  that  observed 
differenees  faithfully  refieet  the  underlying  differenees  on  the  faetors. 

In  sum,  we  fit  a  variety  of  faetor  models  to  the  AFOQT  sub  tests  and  to  multi-item 
eomposites  formed  from  the  subtests  to  examine  the  latent  strueture  of  this  test  battery.  Then  we 
examined  measurement  invarianee  aeross  male/female  and  White/ Afriean 
Ameriean/Hispanic/Other  groups  to  assess  whether  there  was  any  evidenee  of  differential  item  or 
test  funetioning. 
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METHODS 


Sample 

The  data  consisted  of  the  responses  of  12,51 1  applicants  for  USAF  officer 
commissioning  who  were  administered  Form  SI  of  the  AFOQT  between  2005  and  2007.  Mean 
age  at  time  of  testing  was  22.4  years.  The  sample  included  9,424  men  (75%)  and  2,978  women, 
66%%  were  white,  and  more  than  99%  had  completed  at  least  a  high  school  degree.  In  addition 
to  qualification  on  the  AFOQT,  officer  commissioning  and  aircrew  training  applicants  met 
various  academic  (e.g.,  college  degree),  fitness  (e.g.,  physical  fitness  test),  moral  (e.g.,  legal 
issues),  medical  (e.g.,  physical  exam),  and  physical  (e.g.,  weight)  standards. 

To  assess  equivalence  across  race  and  ethnicity,  the  sample  was  divided  into  four  groups 
including  White,  African  American,  Hispanic,  and  Other.  Because  respondents  were  asked  to 
indicate  all  of  the  races  that  applied  to  them,  individuals  marking  more  than  one  race  were 
excluded  from  the  analyses.  This  exclusion  criterion  resulted  in  samples  of  8,296  Whites,  1,181 
African  Americans,  738  Hispanics,  and  728  Others. 


Measures 

AFOQT  Form  S  consists  of  1 1  cognitive  subtests  that  are  combined  into  five  composites. 
Personnel  decisions  including  qualification  for  officer  commissioning  programs  and  aircrew 
training  are  made,  in  part,  on  the  basis  of  the  composites. 

Brief  descriptions  of  the  AFOQT  subtests  grouped  by  content  are  presented  below. 

Verbal  subtests.  Verbal  Analogies  (VA)  provides  a  measure  of  the  ability  to  reason  and 
determine  relationships  between  words.  Word  Knowledge  (WK)  assesses  verbal  comprehension 
involving  the  ability  to  understand  written  language  through  the  use  of  synonyms. 

Quantitative  subtests.  Arithmetic  Reasoning  (AR)  measures  the  ability  to  understand 
arithmetic  relations  expressed  as  word  problems.  Math  Knowledge  (MK)  provides  a  measure  of 
the  ability  to  use  mathematical  terms,  formulas,  and  relations. 

Spatial  subtests.  Block  Counting  (BC)  measures  spatial  ability  through  the  analysis  of 
three-dimensional  representations  of  a  set  of  blocks.  Rotated  Blocks  (RB)  assesses  the  ability  to 
visualize  and  mentally  manipulate  objects.  Hidden  Figures  (HF)  measures  the  ability  to  see  a 
simple  figure  embedded  in  a  complex  drawing. 
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Aircrew  subtests.  Instrument  Comprehension  (IC)  assesses  the  ability  to  determine  the 
attitude  of  an  aireraft  from  illustrations  of  flight  instruments.  Aviation  Information  (AI)  measures 
knowledge  of  general  aviation  terms,  eoneepts,  and  principles.  General  Science  (GS)  provides  a 
measure  of  knowledge  and  understanding  of  scientific,  terms,  concepts,  instruments,  and 
principles. 

Perceptual  speed  subtest.  Table  Reading  (TR)  assesses  the  ability  to  quickly  and 
accurately  extract  information  from  tables. 


Procedures 

Our  starting  model  was  based  on  a  confirmatory  model  of  the  previous  16  subtest  version 
of  the  AFOQT  (Carretta  &  Ree,  1996).  This  model  consisted  of  a  factor  representing  general 
cognitive  ability  (g)  and  five  specific  cognitive  factors  of  verbal,  math,  spatial,  aircrew 
knowledge,  and  perceptual  speed. 


Analyses 

Several  goodness-of-fit  statistics  were  considered.  Our  choice  of  fit  indices  was  guided  in 
part  by  Hu  and  Bentler  (1998,  1999)  who  recommend  using  both  an  incremental  fit  index  and  an 
absolute  fit  index  to  examine  model  fit.  We  chose  the  incremental  fit  indices  of  the  Goodness-of- 
Fit  Index  (GFI;  Tanaka  &  Huba,  1985),  the  Adjusted  Goodness-of-fit  Index  (AGFI;  Joreskog  & 
Sorbom,  1989),  the  Comparative  Fit  Index  (CFI;  Bentler,  1990,  1995),  and  the  Non-Normed  Fit 
Index  (NNFI;  Bentler  &  Bonnett,  1980).  The  absolute  fit  indices  we  examined  were  the 
Standardized  Root  Mean  Square  Residual  (Hu  &  Bentler,  1999)  and  the  Root  Mean  Square  Error 
of  Approximation  (RMSEA;  Browne  &  Cudeck,  1993).  Hu  and  Bentler  (1999)  recommended  the 
following  cutoff  values  as  indicators  of  good  model  fit:  NNEI  and  CEI  of  .95  or  higher,  SRMR 
of  .08  or  less,  and  RMSEA  of  .06  or  less.  In  addition,  previous  research  has  suggested  that  a  GEI 
of  .95  (Marsh  &  Grayson,  1995)  and  an  AGEI  of  .90  (Schermelleh-Engel,  Moosbrugger,  & 
Muller,  2003)  reflect  acceptable  model  fit. 

After  estimating  the  CEA,  eigenvalue  and  eigenvector  analyses  were  conducted  to 
compare  the  Eorm  S  Pilot  and  Navigator-Technical  composites  with  the  same  named  composites 
from  a  previous  AEOQT  form  with  16  subtests.  AEOQT  Eorms  O,  P,  and  Q  had  the  same  16 
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subtests  and  were  equated  to  a  eommon  seale.  The  goal  was  to  assess  the  first  faetor  saturation  of 
eaeh  eomposite  and  to  identify  the  relative  eontribution  of  constructs  to  the  eomposites.  Form  O 
data  were  used  beeause  previous  CFAs  of  the  16  subtest  AFOQT  were  aeeomplished  using  Form 
O  data  (Carretta  &  Ree,  1996). 
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RESULTS 


Table  2  shows  the  eorrelation  matrix  for  the  1 1  subtests  on  Form  SI  (the  eorrelation 
matrix  for  the  item  pareel  matrix  is  available  from  the  first  author).  All  eorrelations  in  Table  2 
are  positive.  The  eorrelations  range  from  .182  (WK  and  TR)  to  .706  (AR  and  MK)  with  a  mean 
of  .413.  These  values  are  similar  to  those  reported  for  the  16  subtest  version  where  the 
eorrelations  ranged  from  .17  (WK  and  EM)  to  .77  (RC  and  WK)  with  a  mean  of  .436  (Carretta  & 
Ree,  1996).  An  eigenvalue  analysis  of  the  adjusted  eorrelation  matrix  (prineipal  axis  faetoring) 
showed  general  eognitive  ability,  g,  aeeounted  for  47  pereent  of  the  varianee.  This  was  estimated 
from  the  first  unrotated  prineipal  faetor  as  diseussed  by  Ree  and  Earles  (1991). 


Table  2.  AFOQT  Form  S  Subtest  Correlation  Matrix 


Subtest  VA 

AR 

WK 

MK 

IC 

BC 

TR 

AI 

GS  RB  HE 

VA 

1.000 

AR 

0.514 

1.000 

WK 

0.691 

0.446 

1.000 

MK 

0.422 

0.706 

0.358 

1.000 

IC 

0.376 

0.456 

0.310 

0.400 

1.000 

BC 

0.350 

0.484 

0.271 

0.410 

0.508 

1.000 

TR 

0.249 

0.389 

0.182 

0.309 

0.332 

0.485 

1.000 

AI 

0.371 

0.349 

0.363 

0.282 

0.612 

0.357 

0.250 

1. 000 

GS 

0.561 

0.530 

0.548 

0.566 

0.471 

0.361 

0.210 

0.485 

1. 000 

RB 

0.353 

0.478 

0.284 

0.423 

0.563 

0.507 

0.306 

0.418 

0.443 

1. 000 

HE 

0.348 

0.422 

0.270 

0.396 

0.485 

0.492 

0.335 

0.333 

0.377 

0.543  1. 000 

Notes.  N  =  12,  51 1;  All  eorrelations  were  signifieant  at  the  p<  .01  level  of  signifioanee. 


Confirmatory  Factor  Analysis 

The  fit  statisties  for  the  models  we  examined  are  shown  in  Table  3.  As  shown,  the  single 
faetor  model  fit  the  data  poorly:  analysis  of  the  eleven  subtests  produeed  an  RMSEA  of  .17  and 
analysis  of  multi-item  eomposites  (i.e.,  pareels)  yielded  an  RMSEA  of  .15.  Both  of  these 
RMSEAs  are  well  above  the  range  that  would  be  eonsidered  a  good  fit  (Hu  &  Bentler,  1999). 
Although  the  four  faetor  model  eorresponding  to  the  operational  eomposites  fit  reasonably  well. 
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the  bifactor  solution  with  five  specific  factors  lit  better.  Specifically,  the  model  with  a  general 
factor  and  five  content  factors  (verbal,  math,  spatial,  aircrew,  and  perceptual  speed)  had  an 
RMSEA  of  .059,  a  NNFl  of  .97,  a  CFl  of  .97,  and  an  SRMR  of  .044  when  analyzing  the  parcels. 
Interestingly,  the  GFl  and  AGFl  indices  were  noticeably  lower  for  all  of  the  models  involving 
parcels,  even  when  all  of  the  other  fit  statistics  indicated  an  excellent  fit.  Our  experience  is  that 
GFl  and  AGFl  are  sensitive  to  model  complexity;  They  tend  to  be  lower  in  models  with  more 
manifest  variables.  Thus,  we  believe  that  the  observed  values  of  GFl  and  AGFl  indicate 
satisfactory  fits  for  the  five-factor  model  with  correlated  factors  and  the  bifactor  model  with  five 
specific  factors.  In  sum,  similar  to  Carretta  and  Ree’s  (1996)  Model  5,  our  results  indicate  that 
the  data  is  best  represented  by  a  general  intelligence  factor  and  five  content-specific  factors 
(verbal,  quantitative,  spatial,  aircrew,  and  perceptual  speed). 


Table  3,  Fit  statistics  for  Confirmatory  Factor  Analysis  Models 


Model 

RMSEA 

NNFl 

GFl 

SRMR  GFl 

AGFl 

Single  Factor  Model  -  Parcels 

0.150 

0.89 

0.89 

0.10 

0.49 

0.45 

Single  Factor  Model-  Sub  tests 

0.170 

0.85 

0.88 

0.082 

0.80 

0.71 

Four-Factor  (Composites)  Model  -  Parcels 

0.075 

0.96 

0.96 

0.062 

0.79 

0.77 

Four-Factor  Model  -  Parcels  (Phi  =  Id) 

0.086 

0.94 

0.94 

0.24 

0.74 

0.71 

Bifactor  with  Composite  Specific  Factors  - 
Parcels 

0.065 

0.97 

0.97 

0.05 

0.84 

0.82 

Five-Factor  (V,  M,  Sp,  AC,  PS)  Model  - 
Subtests 

0.078 

0.97 

0.98 

0.033 

0.97 

0.93 

Five-factor  (V,  M,  Sp,  AC,  PS)  -  Parcels 

0.059 

0.97 

0.97 

0.044 

0.86 

0.84 

Five-Factor  Model  -  Parcels  (Phi  =  Id) 

0.072 

0.96 

0.96 

0.22 

0.80 

0.79 

Bifactor  with  Composite  Specific  Factors  - 

Parcels 

0.053 

0.98 

0.98 

0.057 

0.88 

0.87 

Note.  Phi  =  Id  indicates  that  the  factor  correlation  matrix  O  was  restricted  to  an  identity  matrix, 
i.e.,  factors  were  orthogonal. 
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The  parameter  estimates  of  the  bifaetor  model  are  also  informative.  Although  all 
indieators  had  substantial  loadings  on  both  the  general  and  their  speeifie  faetors,  the  majority  had 
their  strongest  loading  on  a  speeifie  dimension.  Notable  exeeptions  included  the  Block  Counting, 
Rotated  Blocks,  and  Hidden  Figures  subtests,  which  had  their  highest  loadings  on  the  general 
factor.  This  suggests  that  a  large  portion  of  the  variance  in  the  general  factor  is  accounted  for  by 
spatial  ability.  Nonetheless,  most  indicators  still  had  substantial  loadings  on  the  general  factor 
with  estimates  ranging  from  .28  (WK2)  to  .70  (BC3). 

We  analyzed  the  data  with  bifactor  models  where  each  observed  variable  loaded  on  the 
general  factor  and  one  content  factor  and,  to  enhance  the  comparability  of  our  results  to  those  of 
Carretta  and  Ree  (1996),  we  obtained  solutions  where  the  Block  Counting  subtest  was  allowed  to 
load  on  the  general,  spatial,  and  perceptual  speed  factors  and  the  General  Science  subtest  was 
allowed  to  load  on  the  general,  verbal,  and  aircrew  factors.  Results  were  very  similar  for  these 
two  types  of  models  and,  consequently.  Table  3  presents  fit  statistics  for  the  analyses  parallel  to 
Carretta  and  Ree. 

When  we  allowed  cross-loadings,  the  Block  Counting  subtest  had  negative  loadings  on 
the  spatial  dimension  and  positive  loadings  on  the  perceptual  speed  dimension.  The  negative 
loadings  may  be  a  result  of  the  magnitude  of  this  subtest’s  relationship  with  the  general  factor. 
Factor  loadings  for  this  model  are  shown  in  Table  4. 
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Table  4,  Completely  Standardized  Solution  for  the  Bifactor  Model  with  Five  Specific 

Factors 


Parcel 

General 

Verbal 

Quantitative 

Spatial 

Aircrew 

Perceptual 

Speed 

VA  1 

.39 

.46 

— 

— 

— 

— 

VA2 

.31 

.48 

— 

— 

— 

— 

VA3 

.33 

.56 

— 

— 

— 

— 

VA4 

.33 

.31 

— 

— 

— 

— 

VA5 

.39 

.53 

— 

— 

— 

— 

ARl 

.52 

— 

.46 

— 

— 

— 

AR2 

.52 

— 

.49 

— 

— 

— 

AR3 

.50 

— 

.51 

— 

— 

— 

AR4 

.49 

— 

.52 

— 

— 

— 

AR5 

.47 

— 

.46 

— 

— 

— 

WK  1 

.29 

.59 

— 

— 

— 

— 

WK2 

.28 

.61 

— 

— 

— 

— 

WK3 

.30 

.67 

— 

— 

— 

— 

WK4 

.31 

.70 

— 

— 

— 

— 

WK5 

.33 

.70 

— 

— 

— 

— 

MK  1 

.39 

— 

.60 

— 

— 

— 

MK2 

.45 

— 

.57 

— 

— 

— 

MK3 

.40 

— 

.52 

— 

— 

— 

MK4 

.45 

— 

.57 

— 

— 

— 

MK5 

.48 

— 

.56 

— 

— 

— 

IC  1 

.59 

— 

— 

— 

.58 

— 

IC2 

.56 

— 

— 

— 

.57 

— 

IC3 

.62 

— 

— 

— 

.58 

— 

IC4 

.58 

— 

— 

— 

.53 

— 

BC  1 

.69 

— 

— 

-.33 

— 

.06 

BC2 

.66 

— 

— 

-.28 

— 

.09 

BC3 

.70 

— 

— 

-.31 

— 

.11 

BC4 

.67 

— 

— 

-.29 

— 

.12 

TR  1 

.37 

— 

— 

— 

— 

.67 

TR2 

.39 

— 

— 

— 

— 

.65 

TR3 

.40 

— 

— 

— 

— 

.72 

TR4 

.38 

— 

— 

— 

— 

.69 

TR5 

.39 

— 

— 

— 

— 

.66 

TR6 

.34 

— 

— 

— 

— 

.65 

TR7 

.36 

_ 

_ 

_ 

_ 

.70 
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Table  4,  Completely  Standardized  Solution  for  the  Bifactor  Model  with  Five  Specific 


Factors,  continued 

Parcel 

General 

Verbal 

Quantitative  Spatial 

Perceptual 
Aircrew  Speed 

TR8 

.36 

— 

— 

— 

— 

.71 

A1  1 

.42 

— 

— 

— 

.47 

— 

A12 

.34 

— 

— 

— 

.49 

— 

A13 

.43 

— 

— 

— 

.47 

— 

A14 

.34 

— 

— 

— 

.48 

— 

GS  1 

.41 

.32 

— 

— 

.08 

— 

GS2 

.33 

.36 

— 

— 

.13 

— 

GS3 

.45 

.35 

— 

— 

.19 

— 

GS4 

.41 

.30 

— 

— 

.20 

— 

RB  1 

.65 

— 

— 

.09 

— 

— 

RB2 

.61 

— 

— 

.08 

— 

— 

RB  3 

.57 

— 

— 

.08 

— 

— 

HF  1 

.66 

— 

— 

.41 

— 

— 

HF2 

.68 

— 

— 

.40 

— 

— 

HF3 

.67 

_ 

_ 

.37 

_ 

_ 

Measurement  Invariance 

Carretta  and  Ree  (1995)  investigated  the  invarianee  of  faetor  loadings  for  an  earlier  form 
of  the  16  subtest  AFOQT  using  a  sample  of  269,968  applieants  for  U.  S.  Air  Foree  eommissions 
that  were  tested  between  1981  and  1993.  They  eompared  males  (N  =  219,887)  and  females  (N  = 
50,081)  and  also  eompared  Blaek  (N  =  32,798),  Hispanie  (N  =  12,647),  Asian-Ameriean  (N  = 
9,460),  and  Native-Ameriean  (N=2,551)  groups  to  Whites  (N  =  212,238).  Given  these  very  large 
sample  sizes,  it  is  not  surprising  that  they  found  statistieally  signifieant  differenees  in  faetor 
loadings.  More  importantly,  however,  they  found  that  the  differenees  in  faetor  loading  were  very 
small  in  size  (generally  less  than  .05),  indieating  that  the  tests  funetioned  equivalently  aeross 
groups. 

Following  Carretta  and  Ree  (1995),  we  examined  the  measurement  equivalenee  of  the 
bifaetor  model  with  five  eontent  faetors  beeause  it  fit  the  best  in  the  total  sample.  We  also  tested 
the  equivalenee  of  the  four  faetor  model  beeause  of  its  operational  use  by  the  Air  Foree.  Table  5 
shows  the  results  of  the  MACS  analyses  using  a  faetor  pattern  matrix  based  on  the  eurrent 
operational  eomposites  with  eorrelated  faetors  and  Table  6  gives  the  results  for  the  bifaetor 
strueture  with  a  general  faetor  and  the  five  eontent  faetors  deseribed  above.  Similar  to  Carretta 
and  Ree  (1996),  the  Bloek  Counting  and  General  Seienee  subtests  were  allowed  to  eross-load  on 
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additional  specific  factors  in  the  bifactor  model.  Both  Tables  5  and  6  show  only  negligible 
changes  in  the  fit  indices  when  the  constrained  models  are  compared  to  the  baseline  (i.e., 
configural)  solution. 
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Table  5,  Fit  Statistics  for  the  MACS  Analyses  of  the  Four-Factor  Structure 
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Scalar  Invariance  0.074  0.95  0.95  0.066  0.086  0.086  0.081 


Table  6.  Fit  Statistics  for  the  MACS  Analyses  of  the  Bifactor  Structure 
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In  the  analysis  of  measurement  invarianee  for  race,  RMSEA  did  not  change 
whatsoever  from  the  configural  invariance  model  to  the  scalar  invariance  model:  It  was 
.054  for  both,  as  shown  in  Table  6  for  the  model  with  a  general  factor  and  five  content 
factors.  The  NNFI  and  CFI  measures  also  showed  no  change.  For  the  individual  groups, 
moderate  changes  in  SRMRs  were  observed  for  African  Americans,  Hispanics,  and 
Others,  probably  because  these  samples  were  much  smaller  than  the  White  sample. 

A  similar  pattern  of  results  is  apparent  for  the  male-female  comparison.  Table  6 
shows  that  the  RMSFA  had  a  trivial  increase  (.054  to  .055),  and  the  NNFI  and  CFI  did 
not  change  at  all.  The  male  SRMR  did  not  change,  and  the  female  SRMR  increased 
moderately,  probably  because  this  sample  was  much  smaller  than  the  male  sample.  In 
sum,  the  results  suggest  that  there  is  little  or  no  differential  item  and  test  functioning 
across  minority  and  majority  groups. 

Comparison  of  Form  S  Pilot  and  Navigator/Technical  Composites  with  Previous  Forms 

Figenvalue  and  eigenvector  analyses  of  the  AFOQT  Form  S  Pilot  composite 
showed  only  one  eigenvalue  over  1.0  accounting  for  53%  of  the  variance.  A  similar  result 
was  found  for  the  Navigator/Technical  composite  with  only  one  eigenvalue  over  1.0 
accounting  for  55%  of  the  variance. 

The  same  analyses  for  the  previous  AFOQT  Form  O  showed  that  the  first 
eigenvalue  for  the  Pilot  composite  accounted  for  50%  of  the  variance  in  the  matrix  and  a 
second  eigenvalue  over  1.0  accounted  for  13%.  The  Form  O  Navigator/Technical 
composite  had  one  eigenvalue  over  1.0  that  accounted  for  52%  of  the  variance. 
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DISCUSSION 


In  this  study,  a  variety  of  confirmatory  factor  analysis  models  were  fit  to  data 
from  the  recently  revised  AFOQT.  The  results  and  conclusions  are  strikingly  similar  to 
those  obtained  by  Carretta  and  Ree  (1996):  an  important  general  cognitive  ability  factor 
underlies  performance  on  all  of  the  subtests  and  verbal,  mathematical  reasoning,  spatial, 
aircrew,  and  perceptual  speed  factors  underlie  groups  of  subtests.  Moreover,  excellent  fits 
were  obtained,  so  we  can  have  confidence  in  these  findings. 

Mean  and  covariance  structure  analysis  was  used  to  investigate  measurement 
invariance  for  the  AFOQT.  Very  positive  results  were  obtained  in  that  the  overall  fit 
statistics  for  the  most  restrictive  models  (i.e.,  models  specifying  scalar  invariance)  were 
nearly  the  same  size  as  the  fit  statistics  for  the  least  restrictive  models  (i.e.,  models 
specifying  configural  invariance).  This  indicates  that  AFOQT  scores  can  be  used  to  make 
comparisons  across  candidates  irrespective  of  their  gender  or  race. 

Operationally,  personnel  measurement,  selection  and  classification  decisions 
involving  the  AFOQT  are  based  on  composite  scores.  The  Pilot  and  Navigator/Technical 
composites  are  part  of  the  system  for  aircrew  training  qualification  including  pilots, 
combat  system  operators,  and  air  battle  managers.  Therefore,  the  nature  and  performance 
of  these  composites  is  very  important.  Comparison  of  the  prior  Forms  O,  P,  and  Q  and 
current  Form  S  composite  scores  and  underlying  structure  is  informative. 

On  the  first  eigenvector  for  Form  S,  Arithmetic  Reasoning  had  the  greatest 
loading  (largest  value  eigenvector)  at  .502  and  Table  Reading  showed  the  smallest  at 
.362.  The  results  for  the  first  factor  from  the  Form  O  Pilot  composite  showed  high 
eigenvector  values  for  the  perceptual  speed,  spatial,  and  aviation  job  knowledge  sub  tests 
of  Scale  Reading  (.795),  Block  Counting  (.785),  Mechanical  Comprehension  (.750),  and 
Instrument  Comprehension  (.742).  The  magnitudes  of  the  Form  S  loadings  are  much 
smaller  than  the  loadings  for  Form  O.  Further,  Form  S  has  sub  tests  that  are  more 
indicative  of  g  than  Form  O.  The  subtests  on  the  Form  O  Pilot  composite  all  share  the 
characteristic  that  the  male  means  are  noticeably  greater  than  the  female  means.  This  is 
not  the  case  in  Form  S.  The  newly  implemented  Form  S  should  be  expected  to  have 
smaller  male-female  differences  than  Form  O. 
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For  the  Form  S  Navigator-Technical  composite,  the  first  factor  accounted  for  55% 
of  the  variance  with  Arithmetic  Reasoning  again  showing  the  greatest  loading  at  .512  and 
Table  Reading  the  lowest  at  .304.  High-level  findings  for  Form  O  were  similar.  The  first 
factor  accounted  for  52%  of  the  variance.  The  highest  loading  on  the  first  factor  was 
Scale  Reading  (.815)  followed  by  the  three  mathematics  tests;  Arithmetic  Reasoning 
(.807),  Data  Interpretation  (.766),  and  Math  Knowledge  (.788).  Electrical  Maze  (.612) 
showed  the  lowest  value.  Mechanical  Comprehension,  Data  Interpretation,  Scale 
Reading,  and  Electrical  Maze  were  all  removed  from  Eorm  S.  Eurther,  Rotated  Blocks 
and  Hidden  Eigures  were  removed  from  the  Navigator/Technical  composite.  The  Eorm  O 
Navigator/Technical  composite  is  composed  of  four  sub  tests  that  are  good  indicators  of  g, 
Verbal  Analogies,  Arithmetic  Reasoning,  Math  Knowledge,  and  Block  Counting.  It  also 
includes  a  marker  for  perceptual  speed  in  Table  Reading,  the  only  speeded  subtest  on  the 
AEOQT,  and  General  Science  which  has  loadings  on  verbal  and  aircrew  factors.  On  both 
Eorm  S  and  Eorm  O  there  was  not  a  second  eigenvalue  equal  to  or  above  1  for  the 
Navigator/Technical  composite.  Given  the  change  in  the  content  of  Eorms  S  a  difference 
in  validity  might  be  expected.  However,  the  presence  of  the  highly  g  loaded  subtests 
suggests  otherwise. 

In  a  series  of  papers,  Ree  and  colleagues  (Olea  &  Ree,  1994;  Ree,  Carretta,  & 
Teachout,  1996;  Ree,  &  Earles,  1991;  Ree,  Earles,  &  Teachout,  1994)  showed  that 
psychometric  g  was  largely  responsible  for  predicting  performance  in  training  and  on  the 
job  for  a  wide  variety  of  military  samples.  It  appears  that  the  current  form  of  the  AEOQT 
taps  psychometric  g  and  would  be  expected  to  predict  performance  in  ways  similar  to 
previous  forms. 
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